We replicate some of the results from paper Predicting the Present with Google Trends.. We focused on 3.1 Motor vehicles and parts. We were able to replicate everything in that session using original data. There were many difficulties on finding the right data.
The sales data: We first tried the link provided in the paper, but the data is adjusted. We were able to find the unadjusted data through this link. However, the values were different from the original data(used by the authors of the paper). The difference increases as year increases. The reason for that is still unclear. We will compare the two data set in the next session.
The search data: Trucks & SUVs and Auto Insurance
We noticed that the sales values from the data we obtained was always less than the data used in the paper.
# load the data
sales <- read_excel("sales.xls")
merged <- read_csv("merged.csv")
# Period has type char in the data, convert that to yearmonth
ym <- as.yearmon(sales$Period, "%b-%Y")
# use as.Date() to convert the type of ym to date
sales$Period <- as.Date(ym)
#keep only data from 01/2004 through 07/2011 and renames the data frame as unadj, rename the column Value as sales,and add label "unadjusted"
unadj<- sales %>%
filter(Period <="2011-07-01") %>%
mutate(label = "unadjusted") %>%
rename(sales = Value)
# convert sales to numeric
unadj$sales <-as.numeric(unadj$sales)
# add label to original data, keep only Period, sales and label
orig <- merged %>%
mutate(label = "original") %>%
select(-insurance, -suvs)
# stack two data frame
com_data <- rbind(unadj, orig)
#plot log(sales) Vs Period
ggplot(com_data, aes(x=Period, y = log(sales), color = label)) +
geom_line()
# plot orignal against unadjusted
joined_data <- unadj %>% select(-label) %>% rename(unadjusted_sales = sales) %>% left_join(orig, by ="Period") %>% select(-label)
ggplot(joined_data, aes(x = sales, y = unadjusted_sales)) +
geom_point()+
geom_abline(linetype = "dashed") +
xlab('original') +
ylab('unadjusted')+
ggtitle("original sales against unadjusted sales")
From the first graph, we can see that the sales values from both original and unajusted data are very close to each other(almost overlapping). When I plotted original data against unadjusted data, they almost lined up on the diagonal. Note that original is always less than unadjusted. I wonder if people went back and modified that data after the authors obtained the data. Further investigation is needed. In conclusion, the unadjusted data we found is very close to the original data.
The authors did not specify what kind of transformation/normalization/standardization they performed on Google Trends data. Below are what we have tried to transform trends data.
None of these match with the original data. Therefore, we decided to process without any transformation on Google Trends data.
We also tried other key words like “Buy car”, “car dealer”, “Vehicle shop” and so on, but none of these were significant to the model
We would like to compare the outputs of unadjusted data with the outputs from the original data with same trends data.
#take the log of sales, take out the label column
unadj_rep <- unadj %>% mutate(sales=log(sales)) %>% select(-label)
# google_trends contains trends data and Period for join
google_trends <- merged %>% select(-sales)
# join trends to unadj_rep
unadj_with_trends <- unadj_rep %>% left_join(google_trends, by = "Period")
# replicate baseline model
model1 <- lm(data = unadj_with_trends, sales~lag(sales, 1)+lag(sales,12))
#the summary of the model
#summary(model1)
# replicate the model with trends
model_with_trend <- lm(data = unadj_with_trends, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
#summary(model_with_trend)
# replicate figure 2
# creating base
base_unajusted <- unadj_with_trends
for (i in 18:91){
merged_t <- unadj_with_trends[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base_unajusted$sales[i] <- predict(model1,unadj_with_trends[1:i,])[i]
}
base_unajusted <- base_unajusted[18:91,]
# creating trends
trends_unajusted <- unadj_with_trends
for (i in 18:91){
merged_t <- unadj_with_trends[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
trends_unajusted$sales[i] <- predict(model1,unadj_with_trends[1:i,])[i]
}
trends_unajusted <- trends_unajusted[18:91,]
# label different data sets
actual_unajusted <- unadj_with_trends[18:91,]
actual_unajusted <- actual_unajusted %>%
mutate(label ="actual_unajusted")
base_unajusted <- base_unajusted %>%
mutate(label = "base_unajusted")
trends_unajusted <- trends_unajusted %>%
mutate(label ="trends_unajusted")
# R^2 for base
r_base <- (cor(base_unajusted$sales,actual_unajusted$sales))^2
# R^2 for trends
r_trend <- (cor(trends_unajusted$sales,actual_unajusted$sales))^2
# stack all data sets for plotting
plot_data <- rbind(actual, base, trends, actual_unajusted, base_unajusted, trends_unajusted)
ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black","black", "red","red","grey","grey"))+
scale_linetype_manual(values = c("solid","solid","dashed", "dashed", "solid","solid"))+
ylab('log(mvp)')+
xlab('Index')
From the graph, we see that lines are almost overlapped, and \(R^2\) are very close to the ones from original data, 0.7266 for the base and 0.7911028 for the one with trends.
The current month’s trends data was significant to the model even without the transformation during the time period 01/01/2004 to 07/01/2011. We used current month’s trends data as an estimator of the first week of the month since weekly data was not available.
# trend 1
# search data in category Trucks & SUVs between 01/01/2004 and 07/01/2011
suvs_trends <- read_csv("New_Trucks_suvs.csv")
names1<- names(suvs_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
# trend 2
# search data in category Auto Insurance between 01/01/2004 and 07/01/2011
insurance_trends <- read_csv("New_auto_insurance.csv")
names2 <- names(insurance_trends)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
# census data
unadj_full <- sales %>% filter(Period <= "2011-07-01")
#join trends data
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
# join all data
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
# take the log of sales
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))
# summary of model1
tidy(model1)
m1 <- glance(model1)
# model with previous month's trends data
model_with_trend_pre_month <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) +lag(suvs,1) + lag( insurance,1))
tidy(model_with_trend_pre_month)
mtp<-glance(model_with_trend_pre_month)
# model with current month's trends(we don't have weekly data, so we are using the monthly data instead)
model_with_trend_current <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
# summary table of the model
# summary(model_with_trend)
tidy(model_with_trend_current)
mtc<-glance(model_with_trend_current)
The \(R^2\) for base model is 0.7212228, the \(R^2\) for the model with trends data from previous month is 0.7402116, and the \(R^2\) for the model with trends data from current month is 0.7830012. We see an improvement in overall fitness when trends data were from current month (since we don’t have access to the trends data of the first week of the month, we are using the monthly data as an estimator). We will further investigate if this is true with out-of-sample forecasting.
We evaluated the model by out-of-sample forecasting, and we found that the trends data made the predictions less accurate.
base <- with_trends_full
# creating predicted base with rolling window nowcast
for (i in 18:91){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[18:91,]
# creating predicted trends with rolling window nowcast
trends <- with_trends_full
for (i in 18:91){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[18:91,]
# Make the graph
actual <- with_trends_full[18:91,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
# means absolute error
MAE_base <- mean(abs(base$sales-actual$sales))
MAE_trends <- mean(abs(trends$sales-actual$sales))
# data for recession period Dec 2007 to June 2009
recession_trends <- trends %>%
filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base <- base %>%
filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual <- actual %>%
filter(Period>="2007-12-01"& Period<="2009-06-01")
MAE_recession_trends <- mean(abs(recession_trends$sales-recession_actual$sales))
MAE_recession_base <- mean(abs(recession_base$sales-recession_actual$sales))
# Overall improvement
overall_improv <- (mean(abs(base$sales-actual$sales))-mean(abs(trends$sales-actual$sales)))/mean(abs(base$sales-actual$sales))
# recession improvement
recession_improv <- (mean(abs(recession_base$sales-recession_actual$sales))-mean(abs(recession_trends$sales-recession_actual$sales)))/mean(abs(recession_base$sales-recession_actual$sales))
#R^2 for base
r_sqr_base <-(cor(base$sales,actual$sales))^2
# R^2 for trends
# r_sqr_trends <- (cor(trends$sales[3:73],actual$sales[3:73]))^2
r_sqr_trends <- (cor(trends$sales,actual$sales))^2
#We can use MAE() from caret package to calculate MAE
#
# library(caret)
#
# # R native funcitons
# MAE(recession_trends$sales, recession_actual$sales)
# MAE(recession_base$sales, recession_actual$sales)
#
# MAE(trends$sales, actual$sales)
# MAE(base$sales, actual$sales)
The MAE for base model is 0.0636592, and the MAE for the model with trends is 0.0668723. The overall improvement is -5.0473494%. That means adding trends to the model make things worse. We see a totally different story when focus on recession period. The MAE for base model is 0.0890323, and the MAE for the model with trends is 0.0679496 during the recession. There is 23.6798011% improvement after adding trends to the model.
The model produced higher \(R^2\) for time period 08/01/2011 to 12/01/2019. However, the trends data did not improve the performance.
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
start_month <- "2011-07-01" # "2004-01-01"
end_month <- "2019-12-01" # "2020-05-01"
unadj_full <- sales %>% filter(Period <= end_month & Period >start_month)
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))
#the summary of the model
summary(model1)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = with_trends_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07213 -0.01791 0.00315 0.02055 0.06457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.56060 0.28128 5.548 3.14e-07 ***
## lag(sales, 1) 0.01167 0.03899 0.299 0.765
## lag(sales, 12) 0.85541 0.03484 24.550 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02794 on 86 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9485, Adjusted R-squared: 0.9473
## F-statistic: 792.4 on 2 and 86 DF, p-value: < 2.2e-16
# model with trends
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
# summary table of the model
summary(model_with_trend)
##
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + suvs +
## insurance, data = with_trends_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07066 -0.01718 -0.00007 0.01403 0.07169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4104669 0.4317867 0.951 0.34452
## lag(sales, 1) 0.0374763 0.0377719 0.992 0.32396
## lag(sales, 12) 0.9351565 0.0405170 23.081 < 2e-16 ***
## suvs -0.0019539 0.0006383 -3.061 0.00296 **
## insurance 0.0020128 0.0010723 1.877 0.06397 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0265 on 84 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9548, Adjusted R-squared: 0.9526
## F-statistic: 443.2 on 4 and 84 DF, p-value: < 2.2e-16
# Rolling window
base <- with_trends_full
len <- nrow(base)
# creating predicted base with rolling window forecast
for (i in 18:len){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12))
base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[18:len,]
# creating predicted trends with rolling window forecast
trends <- with_trends_full
for (i in 18:len){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[18:len,]
# Make the graph
actual <- with_trends_full[18:len,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
# R^2 for baseline model
r_sqr_base <-(cor(base$sales,actual$sales))^2
# R^2 for model with trends
r_sqr_trends <- (cor(trends$sales,actual$sales))^2
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
For data between 2011-07-01 and 2019-12-01. We don’t see much improvement on the fitness of the model when trends data was included. When we applied rolling window forecasting, we got \(R^2\) = 0.9375906 for baseline model, and \(R^2\) = 0.9168464 for the model with trends data. This is suggesting that adding trends data to the model may decreases prediction accuracy. We also tried this procedure with data from different time period, the results are similar.
We would to like to explore new models for forecasting. We examined two models. At the end, We will try to predict the sales of July, 2020 using these models.
We wanted to forecast by using the baseline model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\) and the model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + b_3insurance + b_4SUVs + e_t\) (search data from month t). We used rolling window forecast to evaluate our model(from 01/01/2017 to 01/01/2019) .
## getting data ready
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>%
rename(suvs = names1[2],
Period = Month)
insurance_trends <- insurance_trends %>%
rename(insurance = names2[2],
Period = Month)
unadj_full <- sales %>% filter(Period <= "2020-05-01")
##Join data
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value)
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
#with_trends_full$sales = as.numeric(with_trends_full$sales)
## starting from 2017
whole_data_2017 <- with_trends_full %>%
filter(Period >="2013-01-01" & Period <="2019-01-01") # we want to start predict 2017-01-01 when it is 2016-10-01, we need data from at least from 2015-10-01 (We want more data before first prediction to train a good model)
months <- nrow(whole_data_2017)
base <- whole_data_2017
start <- 46 # the index for 2016-10-01
# creating predicted base with rolling window nowcast
for (i in start:months){
merged_t <- whole_data_2017[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 2)+lag(sales,14))
base$sales[i] <- predict(model1,whole_data_2017[1:i,])[i]
}
base <- base[start:months,]
# creating predicted trends with rolling window nowcast
trends <- whole_data_2017
for (i in start:months){
merged_t <- whole_data_2017[1:i-1,]
model1 <- lm(data = merged_t, sales~lag(sales, 2)+lag(sales,14)+ lag(suvs,2) + lag(insurance,2)) # we used search data from the month when we were making the prediction
trends$sales[i] <- predict(model1,whole_data_2017[1:i,])[i]
}
trends <- trends[start:months,]
# Make the graph
actual <- whole_data_2017[start:months,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
# MAE for trends
MAE(trends$sales, actual$sales)
## [1] 0.06306789
#MAE for baseline model
MAE(base$sales, actual$sales)
## [1] 0.06151782
# improvement
(MAE(base$sales, actual$sales)-MAE(trends$sales, actual$sales))/MAE(base$sales, actual$sales)
## [1] -0.02519705
#R^2 for baseline model
(cor(base$sales,actual$sales))^2
## [1] 0.01782038
# R^2 for trends
(cor(trends$sales,actual$sales))^2
## [1] 0.01445199
Both models performed badly. We were curious about the reason. We plotted Period VS sales for data from 01/01/2004 through 05/01/2020.
# yearly trends
ggplotly(ggplot(with_trends_full, aes(x = Period, y = exp(sales))) + geom_line())
From this graph, we observed annual trend. The beginning and the end of a year tend to be the lowest point, mid-year tends to be the highest point, and the overall sales is increasing (exceptions during the recession and 2020). This explains why the model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\) performed poorly. In this model, we used data from current month and 12 months ago to predict what would happen two month later. This violates the annual trend the sales data obeys.
Inspired by the annual trend, we want to test a different model for forecasting. \(y_t=b_1y_{t-12}+b_2y_{t-24}+b_3y_{t-36}+b_4y_{t-48}+b_5insurance + b_6SUVs\) Since there is a strong annual trend, we wonder if use the data from previous years of the same month as a predictors will give us a better model.
we used data from 2012-01-01 through 2018-12-01 to train the model, and predict what would the sales be for the whole year of 2019.
Note: we tried to use data from different time periods as training data for the model. It appears that use data from previous 6 years to train the model is the best.
evaluate model by Out-of-sample forecasting
#Start making prediction at 2010-01-01
trends_model <- sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12)
base_model <- sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)
start <- 73 # the index for 2010-01-01
months <- nrow(with_trends_full)
# creating predicted base with rolling window nowcast
base <- with_trends_full
for (i in start:months){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, base_model)
base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[start:months,]
# creating predicted trends with rolling window nowcast
trends <- with_trends_full
for (i in start:months){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, trends_model) # we used search data from the month when we were making the prediction
trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[start:months,]
# Make the graph
actual <- with_trends_full[start:months,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
#R^2 for baseline model
rsb <- paste("Base: R^2 = ", round((cor(base$sales,actual$sales))^2,digits = 4))
# R^2 for trends
rst <- paste("Trend: R^2 = ", round((cor(trends$sales,actual$sales))^2,digits = 4))
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
ggtitle(paste0(rsb,"\n",rst))+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
#Start making prediction at 2010-01-01
trends_model <- sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12)
base_model <- sales~lag(sales, 12)+lag(sales, 48)
start <- 150 # the index for 2010-01-01
months <- nrow(with_trends_full)
# creating predicted base with rolling window nowcast
base <- with_trends_full
for (i in start:months){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, base_model)
base$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
base <- base[start:months,]
# creating predicted trends with rolling window nowcast
trends <- with_trends_full
for (i in start:months){
merged_t <- with_trends_full[1:i-1,]
model1 <- lm(data = merged_t, trends_model) # we used search data from the month when we were making the prediction
trends$sales[i] <- predict(model1,with_trends_full[1:i,])[i]
}
trends <- trends[start:months,]
# Make the graph
actual <- with_trends_full[start:months,]
actual <- actual %>%
mutate(label ="actual")
base <- base %>%
mutate(label = "base")
trends <- trends %>%
mutate(label ="trends")
plot_data <- rbind(actual, base, trends)
#R^2 for baseline model
rsb <- paste("Base: R^2 = ", round((cor(base$sales,actual$sales))^2,digits = 4))
# R^2 for trends
rst <- paste("Trend: R^2 = ", round((cor(trends$sales,actual$sales))^2,digits = 4))
ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
ggtitle(paste0(rsb,"\n",rst))+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(mvp)')+
xlab('Index'))
# Used data from 2012-01-01 through 2018-12-01 to train the model
training_data <- with_trends_full %>%
filter(Period >="2012-01-01"& Period <="2018-12-01")
# I'm using trends data from previous year of the same month for training
model_with_trend_all <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
#summary(model_with_trend_all)
model_with_trend_2 <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
summary(model_with_trend_2)
##
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 48) + lag(insurance,
## 12) + lag(suvs, 12), data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04722 -0.01169 -0.00021 0.01387 0.04380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0243068 0.6480123 1.581 0.12410
## lag(sales, 12) 0.5722287 0.1214147 4.713 4.88e-05 ***
## lag(sales, 48) 0.3559760 0.1129101 3.153 0.00358 **
## lag(insurance, 12) -0.0006713 0.0020917 -0.321 0.75039
## lag(suvs, 12) -0.0010073 0.0012108 -0.832 0.41182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02284 on 31 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.9075, Adjusted R-squared: 0.8956
## F-statistic: 76.07 on 4 and 31 DF, p-value: 1.413e-15
model_without_trend <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48))
summary(model_without_trend)
##
## Call:
## lm(formula = sales ~ lag(sales, 12) + lag(sales, 48), data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.061527 -0.013950 0.000808 0.014460 0.055144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7749 0.6770 2.622 0.0131 *
## lag(sales, 12) 0.7189 0.1272 5.651 2.7e-06 ***
## lag(sales, 48) 0.1306 0.1004 1.301 0.2023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02573 on 33 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.8751, Adjusted R-squared: 0.8675
## F-statistic: 115.6 on 2 and 33 DF, p-value: 1.246e-15
# we want to predict whole year of 2019, so we need data from 2015-01-01 and after
#Change here to see the graph of that model
model <- model_with_trend_2
test_data <- with_trends_full %>%
filter(Period >="2015-01-01" & Period <= "2019-12-01")
predictions <- test_data %>%
add_predictions(model) %>%
filter(!(is.na(pred)) )
MSE <- mean((predictions$pred -predictions$sales)^2)
ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
geom_point() +
geom_abline()+
xlab("Sales")+
ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.9379584
We tried three versions of this model, we found that model \(y_t=b_1y_{t-12}b_4y_{t-48}+b_5insurance + b_6SUVs\) makes better predictions for year 2019. MSE = 6.166375610^{-4} and \(R^2\) = 0.9379584. Since this model works especially well, I would like to know its performance on other years.
\(R^2\) = 0.9712654 for the predictions on 2015. The lowest \(R^2\) was 0.8812861 on 2017’s predictions for years from 2018-2014.
Test the model on 2018
# Used data from 2011-01-01 through 2017-12-01 to train the model
training_data <- with_trends_full %>%
filter(Period >="2011-01-01"& Period <="2017-12-01")
model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
test_data <- with_trends_full %>%
filter(Period >="2014-01-01" & Period < "2019-01-01")
predictions <- test_data %>%
add_predictions(model) %>%
filter(!(is.na(pred)) )
ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
geom_point() +
geom_abline()+
xlab("Sales")+
ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.903958
Test the model on 2017
# Used data from 2010-01-01 through 2016-12-01 to train the model
training_data <- with_trends_full %>%
filter(Period >="2010-01-01"& Period <="2016-12-01")
model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
test_data <- with_trends_full %>%
filter(Period >="2013-01-01" & Period < "2018-01-01")
predictions <- test_data %>%
add_predictions(model) %>%
filter(!(is.na(pred)) )
ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
geom_point() +
geom_abline()+
xlab("Sales")+
ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.8812861
Test the model on 2016
# Used data from 2009-01-01 through 2015-12-01 to train the model
training_data <- with_trends_full %>%
filter(Period >="2009-01-01"& Period <="2015-12-01")
model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
test_data <- with_trends_full %>%
filter(Period >="2012-01-01" & Period < "2017-01-01")
predictions <- test_data %>%
add_predictions(model) %>%
filter(!(is.na(pred)) )
ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
geom_point() +
geom_abline()+
xlab("Sales")+
ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.9002892
Test the model on 2015
# Used data from 2008-01-01 through 2014-12-01 to train the model
training_data <- with_trends_full %>%
filter(Period >="2008-01-01"& Period <="2014-12-01")
model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
test_data <- with_trends_full %>%
filter(Period >="2011-01-01" & Period <= "2015-12-01")
predictions <- test_data %>%
add_predictions(model) %>%
filter(!(is.na(pred)) )
ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
geom_point() +
geom_abline()+
xlab("Sales")+
ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.9712654
Test the model on 2014
# Used data from 2007-01-01 through 2014-12-01 to train the model
training_data <- with_trends_full %>%
filter(Period >="2007-01-01"& Period <="2013-12-01")
model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
test_data <- with_trends_full %>%
filter(Period >="2010-01-01" & Period < "2017-01-01")
predictions <- test_data %>%
add_predictions(model) %>%
filter(Period >= "2014-01-01" )
ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
geom_point() +
geom_abline()+
xlab("Sales")+
ylab("Prediction"))
(cor(predictions$sales, predictions$pred))^2
## [1] 0.916022
In conclusion, this model works well when predicting other years as well.
We want to predict June and/or July with the models we examined above. The tricky part is that how much data should we use to train the model. We believe that using more recent data would be more helpful. To decide what data we would use to train the model, we were changing the training data and test it by predicting the sales of 2019. When we are satisfied with the model, we use it to predict June and/or July. We don’t expect the prediction to be accurate since 2020 is such a special year
# We are making prediction of 2019(whole year) at the begining of 2019 when the data for 2018-12 is available.
# played with different start month, 2016 gives the higher R^2
start <- "2016-01-01"
end <- "2018-12-01"
training_data <- with_trends_full %>%
filter(Period >=start& Period <=end)
model1 <- lm(data = training_data, sales~lag(sales, 1)+lag(sales,12))
testing_data <- with_trends_full %>% filter(Period >="2018-01-01")# 12 month before 2019-01-01
# summary(model1)
predicted <- testing_data %>%
add_predictions(model1) %>%
filter(!(is.na(pred)))
MSE <- mean((predicted$sales -predicted$pred)^2)
(cor(predicted$sales, predicted$pred))^2
## [1] 0.07932483
n <- tidy(model1)
#prediction for June 2020
sales_2020_05 <- as.numeric(with_trends_full %>% filter(Period == "2020-05-01") %>% select(sales))
sales_2019_06 <- as.numeric(with_trends_full %>% filter(Period == "2019-06-01") %>% select(sales))
June <- exp(n$estimate[1] +n$estimate[2]*sales_2020_05+n$estimate[3]*sales_2019_06)
June
## [1] 106580
# prediction for July
sales_2019_07 <- as.numeric(with_trends_full %>% filter(Period == "2019-07-01") %>% select(sales))
July <- exp(n$estimate[1] +n$estimate[2]*log(June)+n$estimate[3]*sales_2019_07)
July
## [1] 110945.9
start <- "2016-01-01"
end <- "2018-12-01"
training_data <- with_trends_full %>%
filter(Period >=start& Period <=end)
model1 <- lm(data = training_data, sales~lag(sales, 2)+lag(sales,14) + insurance + suvs)
testing_data <- with_trends_full %>% filter(Period >="2017-10-01")# 14 month before 2019-01-01
summary(model1)
##
## Call:
## lm(formula = sales ~ lag(sales, 2) + lag(sales, 14) + insurance +
## suvs, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.136168 -0.010870 0.003921 0.026308 0.078363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.007491 2.006196 5.487 4.01e-05 ***
## lag(sales, 2) 0.570992 0.526575 1.084 0.293
## lag(sales, 14) -0.574034 0.527727 -1.088 0.292
## insurance 0.000679 0.003944 0.172 0.865
## suvs 0.005389 0.003543 1.521 0.147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05641 on 17 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2592, Adjusted R-squared: 0.08492
## F-statistic: 1.487 on 4 and 17 DF, p-value: 0.25
predicted <- testing_data %>%
add_predictions(model1) %>%
filter(!(is.na(pred)))
MSE <- mean((predicted$sales -predicted$pred)^2)
(cor(predicted$sales, predicted$pred))^2
## [1] 0.09990959
n <- tidy(model1)
#prediction for June 2020
sales_2020_05 <- as.numeric(with_trends_full %>% filter(Period == "2020-05-01") %>% select(sales))
sales_2019_05 <- as.numeric(with_trends_full %>% filter(Period == "2019-05-01") %>% select(sales))
June <- exp(n$estimate[1] +n$estimate[2]*sales_2020_05+n$estimate[3]*sales_2019_05)
June
## [1] 56090.57
We use this model to predicting the sales for every month in 2020 since this model only depends on year from previous years which is available to us now.
start <- "2013-01-01"
end <- "2019-12-01"
training_data <- with_trends_full %>%
filter(Period >=start& Period <=end)
model1 <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
# testing data has period up to 2020-12-01, these are place holders for predicted values
data_for_predict <- sales %>%
select(Period) %>%
left_join(with_trends_full, by = "Period") %>%
filter(Period >="2016-01-01")# 48 month before 2020-01-01
#summary(model1)
predicted <- testing_data %>%
add_predictions(model1) %>%
filter(!(is.na(pred)))
pre <- predicted %>% select(Period,sales,pred) %>% mutate(sales= exp(sales), pred=exp(pred))
knitr::kable(pre,format = "markdown")
| Period | sales | pred |
|---|
In conclusion, the hardest part of replicating results of published paper is obtaining the data. It was fun!